{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# 11.ScrapySpider Scrapy爬虫框架\n",
    "\n",
    "在上一章我们了解了pyspider框架的用法，我们可以利用它快速完成爬虫的编写\n",
    "\n",
    "不过 pyspider框架也有一些缺点，比如可配置化程度不高，异常处理能力有限等，\n",
    "它对于一些反爬程度非常强的网站的爬取显得力不从心\n",
    "\n",
    "所以本章我们再介绍一个爬虫框架 Scrapy\n",
    "\n",
    "Scrapy 功能非常强大，爬取效率高，相关扩展组件多，可配置和可扩展程度非常高，它几乎可以应对所有反爬网站，是目前Python使用最广泛的爬虫框架\n",
    "\n",
    "## 1. Introduction介绍\n",
    "\n",
    "Scrapy是基于 Twisted 的异步处理框架，是纯 Python 实现的爬虫框架，其架构清晰，\n",
    "模块之间的耦合程度低，可扩展性极强，可以灵活完成各种需求\n",
    "\n",
    "我们只需要定制开发几个模块就可以轻松实现一个爬虫\n",
    "\n",
    "下图是它的框架架构：\n",
    "![](ScrapyFrame.JPG)\n",
    "\n",
    "它可以分成几个部分：\n",
    "\n",
    "**engine\t引擎，类似于一个中间件，负责控制数据流在系统中的所有组件之间流动，可以理解为“传话者”**\n",
    "\n",
    "**spider\t爬虫，负责解析response和提取Item**\n",
    "\n",
    "**downloader\t下载器，负责下载网页数据给引擎**\n",
    "\n",
    "**scheduler\t调度器，负责将url入队列，默认去掉重复的url**\n",
    "\n",
    "**item pipelines\t管道，负责处理被spider提取出来的Item数据**\n",
    "\n",
    "**下载器中间件(Downloader Middlewares)**\n",
    "\n",
    "位于Scrapy引擎和下载器之间的框架，主要是处理Scrapy引擎与下载器之间的请求及响应。\n",
    "\n",
    "爬虫中间件(Spider Middlewares)\n",
    "\n",
    "介于Scrapy引擎和爬虫之间的框架，主要工作是处理蜘蛛的响应输入和请求输出。\n",
    "\n",
    "调度中间件(Scheduler Middewares)\n",
    "\n",
    "介于Scrapy引擎和调度之间的中间件，从Scrapy引擎发送到调度的请求和响应。\n",
    "\n",
    "爬取流程：\n",
    "\n",
    "1、从spider中获取到初始url给引擎，告诉引擎帮我给调度器；\n",
    "\n",
    "2、引擎将初始url给调度器，调度器安排入队列；\n",
    "\n",
    "3、调度器告诉引擎已经安排好，并把url给引擎，告诉引擎，给下载器进行下载；\n",
    "\n",
    "4、引擎将url给下载器，下载器下载页面源码；\n",
    "\n",
    "5、下载器告诉引擎已经下载好了，并把页面源码response给到引擎；\n",
    "\n",
    "6、引擎拿着response给到spider，spider解析数据、提取数据；\n",
    "\n",
    "7、spider将提取到的数据给到引擎，告诉引擎，帮我把新的url给到调度器入队列，把信息给到Item Pipelines进行保存；\n",
    "\n",
    "8、Item Pipelines将提取到的数据保存，保存好后告诉引擎，可以进行下一个url的提取了；\n",
    "\n",
    "9、循环3-8步，直到调度器中没有url，关闭网站（若url下载失败了，会返回重新下载）。\n",
    "\n",
    "Anaconda Prompt 命令行安装，使用下面的命令："
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "pycharm": {
     "name": "#%% raw\n"
    }
   },
   "source": [
    "conda install scrapy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "项目结构：\n",
    "Scrapy框架和pyspider不同，它是通过命令行来建项目的，代码编写还是需要IDE编写项目。\n",
    "\n",
    "指令：在下面的命令行输入"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "scrapy startproject test_spider"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "会在当前目录下创建一个命名为test_spider的项目\n",
    "\n",
    "例如下面是Pycharm中截图的文件夹模型：![](ScrapiFrame.JPG)\n",
    "\n",
    "**__init__.py**\t初始化文件\n",
    "\n",
    "**items.py**\t存放的是要爬取的字段。Item是保存爬取到的数据的容器。\n",
    "\n",
    "**middlewares.py**\t中间件\n",
    "\n",
    "**pipeines.py**\t管道文件，负责处理被spider提取出来的Item，例如数据持久化（将爬取的结果保存到文件/数据库中）\n",
    "\n",
    "**settings.py**\t配置文件\n",
    "\n",
    "**spiders**\tspider核心代码的目录\n",
    "\n",
    "### 创建Spider\n",
    "\n",
    "Spider 是自己定义的类， Scrapy 用它来从网页里抓取内容，并解析抓取的结果。\n",
    "\n",
    "不过这个类必须继承 Scrapy 提供的 Spider类的scrapy.Spider。\n",
    "\n",
    "还要定义 Spider 的名称和起始请求，以及怎样处理爬取的结果的方法\n",
    "\n",
    "首先使用命令行创建：(注意：这里以网址[http://quotes.toscrape.com/](http://quotes.toscrape.com/)为例)"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "pycharm": {
     "name": "#%% raw\n"
    }
   },
   "source": [
    "cd Scrpider\n",
    "scrapy genspider quote quotes.toscrape.com\n",
    "/*scrapy genspider 爬虫名称 允许爬取的域*/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](cmdspider.JPG)\n",
    "再看架构出现了新的quote.py文件：\n",
    "![](quoteSpy.JPG)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "显示内容如下所示："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import scrapy\n",
    "class QuoteSpider(scrapy.Spider):\n",
    "    name = 'quote'\n",
    "    allowed_domains = ['quotes.toscrape.com']\n",
    "    start_urls = ['http://quotes.toscrape.com/']\n",
    "    def parse(self, response):\n",
    "        pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里有3个属性——name、allowed_domains和start_urls，\n",
    "还有一个方法：parse\n",
    "\n",
    "name是每个项目唯一的名字，用来区分不同的 Spider\n",
    "\n",
    "allowed_domains是允许爬取的域名，如果初始或后续的请求链接不是这个域名下的，则\n",
    "请求链接会被过滤掉\n",
    "\n",
    "start_urls ，它包含了 Spider 在启动时爬取的url列表，初始请求是由它来定义的\n",
    "\n",
    "parse，它是 Spider 一个方法。默认情况下，被调用时 start_urls 里面的链接构成的请求完\n",
    "成下载执行后，返回的响应就会作为唯一的参数传递给这个函数。\n",
    "\n",
    "该方法负责解析返回的响应、提取数据或者进一步生成要处理的请求\n",
    "\n",
    "## 修改/创建 items.py\n",
    "\n",
    "Item 是保存爬取数据的容器，它的使用方法和字典类似。\n",
    "\n",
    "不过，相比字典， Item 多了额外的保护机制，可以避免拼写错误或者定义字段错误\n",
    "\n",
    "创建Item需要继承scrapy.Item类，并且定义类型为 scrapy.Field 字段\n",
    "\n",
    "观察目标网站，我们可以获取到到内容有text/author/tags\n",
    "\n",
    "下面我们定义Item，将item.py修改如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Define here the models for your scraped items\n",
    "#\n",
    "# See documentation in:\n",
    "# https://docs.scrapy.org/en/latest/topics/items.html\n",
    "\n",
    "import scrapy\n",
    "class ScrpiderItem(scrapy.Item):\n",
    "    # define the fields for your item here like:\n",
    "    # name = scrapy.Field()\\\n",
    "    text = scrapy.Field()\n",
    "    author = scrapy.Field()\n",
    "    tags = scrapy.Field()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里定义三个字段，实现类的名称改成QuoteItem，爬取这个会用到这个Item\n",
    "\n",
    "## 解析Response\n",
    "\n",
    "前面我们看到， parse()方法的参数 response以及start_urls 里面的链接爬取后的结果。所以在\n",
    "parse()方法中，我们可以直接对 response 变量包含的内容进行解析，比如浏览请求结果的网页源代\n",
    "码，或者进一步分析源代码内容，或者找出结果中的链接而得到下一个请求.\n",
    "\n",
    "我们可以看到网页中既有我们想要的结果，又有下一页的链接，这两部分内容我们都要进行处理。\n",
    "\n",
    "看看网页结构。每一页都有多个 class quote 的区块， 每个区块内都包括了\n",
    "text author tags 那么我们先找出所有的quote，然后提取每一个quote中的内容\n",
    "\n",
    "提取的方式可以是 CSS 选择器或 XPath 选择器。在这里我们使用CSS选择器进行选择"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import scrapy\n",
    "class QuoteSpider(scrapy.Spider):\n",
    "    name = 'quote'\n",
    "    allowed_domains = ['quotes.toscrape.com']\n",
    "    start_urls = ['http://quotes.toscrape.com/']\n",
    "    def parse(self, response):\n",
    "        quotes = response.css('.quote')\n",
    "        for quote in quotes:\n",
    "            text = quote.css('.text::text').extract_first()\n",
    "            author = quote.css('.author::text').extract_first()\n",
    "            tags = quote.css('.tags .tag::text').extract()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里首先利用选择器选取所有的quote，并将其赋值为quotes变量，然后利用for循环对每个quote\n",
    "遍历，解析每个quote的内容\n",
    "\n",
    "对text来说，观察到它的class=text ，所以可以用.text选择器来选取，这个结果实际上是整\n",
    "个带有标签的节点，要获取它的正文内容，可以加::text来获取,这时的结果是长度为1的列表，所\n",
    "以需要 extract_first()方法来获取第1个元素\n",
    "\n",
    "而对tags来说，由于我们要获取所有的标签，所以用 extract()方法获取整个列表即可\n",
    "\n",
    "## 使用Item\n",
    "\n",
    "Item 可以理解为字典，不过在声明的时候需要实例化。\n",
    "然后依次用刚才解析的结果赋值 Item 的每一段，最后将Item返回即可\n",
    "改写Item"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import scrapy\n",
    "from spider.items import ScrpiderItem\n",
    "class QuoteSpider(scrapy.Spider):\n",
    "    name = 'quote'\n",
    "    allowed_domains = ['quotes.toscrape.com']\n",
    "    start_urls = ['http://quotes.toscrape.com/']\n",
    "    def parse(self, response):\n",
    "        quotes = response.css('.quote')\n",
    "        for quote in quotes:\n",
    "            text = quote.css('.text::text').extract_first()\n",
    "            author = quote.css('.author::text').extract_first()\n",
    "            tags = quote.css('.tags .tag::text').extract()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 后续Request\n",
    "\n",
    "上面的操作实现了从初始页面抓取内容。那么，下页的内容该如何抓取？\n",
    "\n",
    "这就需要我们从当前页面中找到信息来生成下个请求，\n",
    "\n",
    "然后在下个请求的页面里找到信息再构造再下个请求。这样循环往复迭代，从而实现整站的爬取\n",
    "\n",
    "可以看到网页的下面有一个next的按钮，检查相关按钮与链接实现匹配。\n",
    "\n",
    "构造请求时需要用到scrapy.Request。这里我们传递两个参数——url&callback，介绍如下："
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "url：它是请求链接\n",
    "callback：它是回调函数。当指定了该回调函数的请求完成之后，获取到响应，引擎会将该\n",
    "响应作为参数传递给这个回调函数。回调函数进行解析或生成下一个请求，回调函数如上文\n",
    "parse()所示"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来我们要做的就是利用选择器得到下一页链接并生成请求，在 parse()方法后追加如下的代码：\n"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "pycharm": {
     "name": "#%% raw\n"
    }
   },
   "source": [
    "next = response.css('.pager .next a::attr(href)').extract_first()\n",
    "url = response.urljoin(next)\n",
    "yield scrapy.Request(url=url,callback=self.parse)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "调用了 urljoin()方法， urljoin()方法可以将相对 URL 构造成 个绝对的 URL 例如，\n",
    "获取到的下一页地址是／page/2, urljoin()方法处理.\n",
    "\n",
    "得到的结果就是[http://quotes.toscrape.com/page/2](http://quotes.toscrape.com/page/2)\n",
    "\n",
    "第三句代码通过 url与callback变量构造了一个新的请求，回调函数callback依然使用parse()\n",
    "方法。这个请求完成后，响应会重新经过 parse 方法处理，得到第二页的解析结果，然后生成第二页\n",
    "的下一页，也就是第 页的请求 这样爬虫就进入了一个循环，直到最后一页\n",
    "\n",
    "大致改写如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import scrapy\n",
    "from Scrpider.items import ScrpiderItem\n",
    "\n",
    "class QuoteSpider(scrapy.Spider):\n",
    "    name = 'quote'\n",
    "    allowed_domains = ['quotes.toscrape.com']\n",
    "    start_urls = ['http://quotes.toscrape.com/']\n",
    "\n",
    "    def parse(self, response):\n",
    "        quotes = response.css('.quote')\n",
    "        for quote in quotes:\n",
    "            item = ScrpiderItem()\n",
    "            item['text'] = quote.css('.text::text').extract_first()#爬取第一个\n",
    "            item['author'] = quote.css('.author::text').extract_first()\n",
    "            item['tags'] = quote.css('.tags .tag::text').extract()#爬取所有tags标签项目\n",
    "            yield item\n",
    "        next = response.css('.pager .next a::attr(href)').extract_first()\n",
    "        url = response.urljoin(next)\n",
    "        yield scrapy.Request(url=url,callback=self.parse)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## 运行\n",
    "进入cmd目录，执行命令实现爬取"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "scrapy crawl quote"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "注意：项目需要进入目标文件夹中才能实现。\n",
    "\n",
    "项目截图：\n",
    "![](cmdScrapyQuote.JPG)\n",
    "![](cmdScrapyQuote2.JPG)\n",
    "\n",
    "## 保存到文件\n",
    "\n",
    "要完成这个任务其实不需要任何额外的代码， Scrapy 提供的 Feed Exports可以轻松将抓取结果输出\n",
    "例如，我们想将上面结果保存成 JSON 文件，可以执行如下命令："
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "scrapy crawl quote -o quote.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "命令运行后，项目内多了一个 quotes.json 文件，文件包含了刚才抓取的所有内容，内容是JSON\n",
    "格式\n",
    "另外我们还可以每一个Item输出一行JSON，输出后jl为jsonline的缩写，命令如下所示："
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "scrapy crawl quote -o quote.jl"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "输出格式还支持很多种，例如 csv/xml/pickle/marsha等，还支持 ftp、s3 等远程输出，另外\n",
    "还可以通过自定义 ItemExporter 来实现其他的输出，对应的命令："
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "scrapy crawl quote -o quote.[args]\n",
    "args=csv/xml/pickle/marsha......\n",
    "scrapy crawl quote -o ftp://user:pass@ftp.example.com/path/quotes.csv\n",
    "#指令实现上传到FTP上，注意相关路径的问题"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通过 Scrapy 提供的 Feed Exports，我们可以轻松地输出抓取结果到文件\n",
    "\n",
    "对于一些小型项目来说，这应该足够了。\n",
    "\n",
    "不过如果想要更复杂的输出，如输出到数据库等，我们可以使用 Item Pipeline 完成\n",
    "\n",
    "## Item Pipeline 的使用\n",
    "\n",
    "如果想进行更复杂的操作，如将结果保存到MongoDB数据库，或者筛选某些有用的Item，则\n",
    "们可以定义 Item Pipeline 来实现。\n",
    "\n",
    "Item pipeline的执行操作：\n",
    "\n",
    "1.清理 HTML 数据\n",
    "\n",
    "2.验证爬取数据，检查爬取字段\n",
    "\n",
    "3.查重井丢弃重复内容\n",
    "\n",
    "4.将爬取结果保存到数据库\n",
    "\n",
    "要实现 Item Pipeline很简单，只需要定义一个类并实现process_item()方法即可。启用 Item Pipeline\n",
    "后， Item Pipeline会自动调用这个方法 process_item()方法必须返回包含数据的字典或 Item 对象，\n",
    "或者抛出 Dropltem 异常\n",
    "\n",
    "process_item()有2个参数，分别是item(传参)与spider(Spider的实例)\n",
    "\n",
    "下面我们模拟实现一个Item Pipeline筛选len(text)>50的Item并且保存到MongoDB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "ename": "IndentationError",
     "evalue": "expected an indented block (<ipython-input-1-855ae7016547>, line 4)",
     "output_type": "error",
     "traceback": [
      "\u001B[1;36m  File \u001B[1;32m\"<ipython-input-1-855ae7016547>\"\u001B[1;36m, line \u001B[1;32m4\u001B[0m\n\u001B[1;33m    def process_item(self, item, spider):\u001B[0m\n\u001B[1;37m    ^\u001B[0m\n\u001B[1;31mIndentationError\u001B[0m\u001B[1;31m:\u001B[0m expected an indented block\n"
     ]
    }
   ],
   "source": [
    "#file: pipelines.py\n",
    "from scrapy.exceptions import DropItem\n",
    "class ScrpiderPipeline:\n",
    "    def __init__(self):\n",
    "        self.limit = 50\n",
    "    def process_item(self, item, spider):\n",
    "        if item['text']:\n",
    "            if len(item['text']) >self.limit:\n",
    "                item['text'] = item['text'][0:self.limit].rstrip() +\"...\"\n",
    "            return item\n",
    "        else:\n",
    "            return DropItem('Missing TextItem...')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "这段代码在构造方法里定义了限制长度为50，\n",
    "实现了 process_item()方法，其数是item\n",
    "和spider。首先该方法判断item['text'] 属性是否存在，\n",
    "如果不存在，则抛出DropItem异常\n",
    "\n",
    "下面我们需要把数据存入MongoDB中去，需要定义另外一个Pipeline\n",
    "\n",
    "同样在pipelines.py文件中实现类MongoPipeine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import pymongo\n",
    "class MongoPipeine(object):\n",
    "    def __init__(self,mongo_uri,mongo_db):\n",
    "        self.mongo_uri = mongo_uri\n",
    "        self.mongo_db = mongo_db\n",
    "    @classmethod\n",
    "    def from_crawler(cls,crawler):\n",
    "        return cls(\n",
    "            mongo_uri=crawler.settings.get('MONGO_URI'),\n",
    "            mongo_db=crawler.settings.get('MONGO_DB')\n",
    "        )\n",
    "\n",
    "    def open_spider(self,spider):\n",
    "        self.client = pymongo.MongoClient(self.mongo_uri)\n",
    "        self.db = self.client[self.mongo_db]\n",
    "\n",
    "    def process_item(self,item,spider):\n",
    "        name = item.__class__.__name__\n",
    "        self.db[name].insert(dict(item))\n",
    "        return item\n",
    "\n",
    "    def close_spider(self,spider):\n",
    "        self.client.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "定义好 TextPipeline MongoPipeline 这两个类后，我们需要在 settings.py 中使用它们 MongoDB\n",
    "的连接信息还需要定义。\n",
    "\n",
    "我们在settings.py中加入如下内容："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "ITEM_PIPELINES={\n",
    "    'Scrpider.pipelines.TextPipeline':300,\n",
    "    'Scrpider.pipelines.MongoPipeine':400,\n",
    "}\n",
    "MONGO_URI='localhost'\n",
    "MONGO_DB='Scrpider'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "赋值ITEM PIPELINES字典，键名Pipeline类名称，键值是调用优先级，是一个数字，数字越\n",
    "小则对应的Pipeline越先被调用\n",
    "\n",
    "再重新执行爬取，命令如下："
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "scrapy crawl quote"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "爬取结束后会在MongoDB中创建一个Scrpider的数据库\n",
    "\n",
    "这里使用了MongoDB Compass查看相关的数据库，看到相关的数据已经内容：\n",
    "![](MongoScrapy.JPG)\n",
    "\n",
    "长的text已经被处理并追加了省略号，短的text保持不变，author与tags也都相应保存\n",
    "\n",
    "## Scrapy Selector选择器用法\n",
    "\n",
    "我们之前介绍了利用 BeautifulSoup pyquery以及正则表达式来提取网页数据，这确实非常方便\n",
    "\n",
    "Scrapy 还提供了自己的数据提取方法，即 Selector（选择器）Selector 是基于lxml来构建的，支持\n",
    "XPath选择器、css选择器以及正则表达式，功能全面，解析速度和准确度非常高。\n",
    "\n",
    "下面介绍Selector的用法：\n",
    "\n",
    "Selector 是一个可以独立使用的模块。\n",
    "\n",
    "我们可以直接利用Selector这个类来构建一个选择器对象，\n",
    "然后调用它的相关方法xpath()/css()等来提取数据\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hello World\n"
     ]
    }
   ],
   "source": [
    "from scrapy import Selector\n",
    "target = '<html><head><title>Hello World</title></head></html>'\n",
    "selector = Selector(text=target)\n",
    "print(selector.xpath('//title/text()').extract_first())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这里没有在Scrapy框架中运行,而是把Scrapy中的Selector单独拿出来使用了，\n",
    "\n",
    "与BeautifulSoup等库类似， Selector 其实也是强大的网页解析库。如果方便的话，我们也可以在其他项目中直接使用Selector来提取数据\n",
    "\n",
    "## Scrapy Shell\n",
    "\n",
    "url: [http://doc.scrapy.org/en/latest/_static/selectors-sample1.html](http://doc.scrapy.org/en/latest/_static/selectors-sample1.html)\n",
    "\n",
    "在命令行输入下面命令："
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "实现进入shell模式。这个过程其实是，Scrapy发起了请求，请求的URL就是刚才\n",
    "命令行下输入的URL，然后把可操作的变量传递给我们，如request、response等\n",
    "\n",
    "![](Scrapyshell.JPG)\n",
    "\n",
    "界面类似IPython交互界面,下面的项目会在这里进行实践：\n",
    "\n",
    "### 项目1：xpath解析提取：\n",
    "![](Scrapyshellxpath.JPG)\n",
    "\n",
    "打印结果的形式是 Selector 组成的列表，其实它是Selectorlist类型.SelectorList与Selector\n",
    "可以继续调用 xpath()和css()等方法来进一步提取数据，例如下面提取图片\n",
    "\n",
    "![](Scrapyshellxpath2.JPG)\n",
    "值得注意的是，选择器的最前方加'.'，这代表提取元素内部的数据，如果没有加点，则代\n",
    "从根节点开始提取。\n",
    "\n",
    "此处我们用了./img的提取方式，则代表从节点里进行提取；如果此处我们用\n",
    "//img，则还是从html节点里进行提取\n",
    "\n",
    "但是现在获取的内容是 Selector 或者 Se lector list 类型，并不是真正的文本内容 那么具体的\n",
    "内容怎么提取呢？我们调用extract()方法。\n",
    "\n",
    "![](Scrapyshellxpath3.JPG)\n",
    "\n",
    "有时候我们会调用extract_first()方法，获取第一个相关文本信息，可以有效地解决问题\n",
    "\n",
    "另外也可以使用extract_first()方法设置一个默认值参数，\n",
    "这样当XPath规则提取不到内容时会直接使用默认值"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "pycharm": {
     "name": "#%% raw\n"
    }
   },
   "source": [
    "response.xpath('//a[@href=\"image1\"]/text()').extract_first('Default')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "结果实验：\n",
    "![](Scrapyshellxpath4.JPG)\n",
    "\n",
    "### 项目2：CSS解析提取\n",
    "\n",
    "同xpath,scrapy的.css()方法可以实现基本的用CSS选择器选择对应的元素"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "pycharm": {
     "name": "#%% raw\n"
    }
   },
   "source": [
    "response.css('a')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "得到结果：![](Scrapcss.JPG)\n",
    "\n",
    "我们可以随意自由的地嵌套XPath和CSS结合使用，例如："
   ]
  },
  {
   "cell_type": "raw",
   "source": [
    "response.xpath('//a').css('img').xpath('@src').extract_first()"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "返回的是所有img节点的src属性\n",
    "\n",
    "## 正则匹配\n",
    "\n",
    "Scrapy也支持正则匹配技术，同样使用response.re()方法得到对应的正则表达式"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "raw",
   "source": [
    "response.xpath('//a/text()').re('Name:\\s(.*)')"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "可以匹配出对应的分组，如果存在多分组，则会按序输出。\n",
    "\n",
    "如果多输出要提取第一个元素，可以使用：re_first()方法实现。\n",
    "\n",
    "但是response对象不能直接使用re()和re_first()方法，可以先进行xpath或者css匹配后再进行re匹配。否则会报错AttributeError找不到re属性\n",
    "\n",
    "## 1.Spider用法\n",
    "\n",
    "实现 Scrapy 爬虫项目时，最核心的类便是 Spider 类了，它定义了如何爬取某个网站的流程和\n",
    "解析方式。\n",
    "\n",
    "简单来讲，Spider 要做的事就是如下两件：\n",
    "\n",
    "1.定义爬取网站的动作； 2.分析爬取下来的网页\n",
    "\n",
    "对于 Spider 类来说，整个爬取循环过程如下所述：\n",
    "\n",
    "以初始的URL初始化Request，并设置回调函数。当该Request成功请求并返回时，Response生成并作为参数传给该回调函数。\n",
    "\n",
    "在回调函数内分析返回的网页内容。返回结果有两种形式。一种是解析到的有效结果返回字典或Item对象，它们可以经过处理后（或直接）保存。另一种是解析得到下一个（如下一页）链接，可以利用此链接构造Request并设置新的回调函数，返回Request等待后续调度。\n",
    "\n",
    "如果返回的是字典或Item对象，我们可通过Feed Exports等组件将返回结果存入到文件。如果设置了Pipeline的话，我们可以使用Pipeline处理（如过滤、修正等）并保存。\n",
    "\n",
    "如果返回的是Reqeust，那么Request执行成功得到Response之后，Response会被传递给Request中定义的回调函数，在回调函数中我们可以再次使用选择器来分析新得到的网页内容，并根据分析的数据生成Item。\n",
    "\n",
    "通过以上几步循环往复进行，我们完成了站点的爬取。\n",
    "\n",
    "## 2.Spider类分析\n",
    "\n",
    "在上一节的例子中，我们定义的Spider是继承自scrapy.spiders.Spider。scrapy.spiders.Spider这个类是最简单最基本的Spider类，其他Spider必须继承这个类。还有后面一些特殊Spider类也都是继承自它。\n",
    "\n",
    "scrapy.spiders.Spider这个类提供了start_requests()方法的默认实现，读取并请求start_urls属性，并根据返回的结果调用parse()方法解析结果。它还有如下一些基础属性：\n",
    "\n",
    "name。爬虫名称，是定义Spider名字的字符串。Spider的名字定义了Scrapy如何定位并初始化Spider，它必须是唯一的。不过我们可以生成多个相同的Spider实例，数量没有限制。name是Spider最重要的属性。如果Spider爬取单个网站，一个常见的做法是以该网站的域名名称来命名Spider。例如，Spider爬取mywebsite.com，该Spider通常会被命名为mywebsite。\n",
    "\n",
    "allowed_domains。允许爬取的域名，是可选配置，不在此范围的链接不会被跟进爬取。\n",
    "\n",
    "start_urls。它是起始URL列表，当我们没有实现start_requests()方法时，默认会从这个列表开始抓取。\n",
    "\n",
    "custom_settings。它是一个字典，是专属于本Spider的配置，此设置会覆盖项目全局的设置。此设置必须在初始化前被更新，必须定义成类变量。\n",
    "\n",
    "crawler。它是由from_crawler()方法设置的，代表的是本Spider类对应的Crawler对象。Crawler对象包含了很多项目组件，利用它我们可以获取项目的一些配置信息，如最常见的获取项目的设置信息，即Settings。\n",
    "\n",
    "settings。它是一个Settings对象，利用它我们可以直接获取项目的全局设置变量。\n",
    "\n",
    "除了基础属性，Spider还有一些常用的方法：\n",
    "\n",
    "start_requests()。此方法用于生成初始请求，它必须返回一个可迭代对象。此方法会默认使用start_urls里面的URL来构造Request，而且Request是GET请求方式。如果我们想在启动时以POST方式访问某个站点，可以直接重写这个方法，发送POST请求时使用FormRequest即可。\n",
    "\n",
    "parse()。当Response没有指定回调函数时，该方法会默认被调用。它负责处理Response，处理返回结果，并从中提取出想要的数据和下一步的请求，然后返回。该方法需要返回一个包含Request或Item的可迭代对象。\n",
    "\n",
    "closed()。当Spider关闭时，该方法会被调用，在这里一般会定义释放资源的一些操作或其他收尾操作。\n",
    "\n",
    "## Downloader Middleware\n",
    "\n",
    "Downloader Middleware即下载中间件，它是处于Scrapy的Request和Response之间的处理模块。\n",
    "\n",
    "Scheduler从队列中拿出一个Request发送给Downloader执行下载，这个过程会经过Downloader Middleware的处理。\n",
    "\n",
    "另外，当Downloader将Request下载完成得到Response返回给Spider时会再次经过Downloader Middleware处理。\n",
    "\n",
    "Downloader Middleware 在整个架构中起作用的位置是以下两个：\n",
    "\n",
    "**1.Scheduler 调度出队列的 Request 发送给 Downloader 下载之前，也就是我们可以在 request\n",
    "执行下载之前对其进行修改**\n",
    "\n",
    "**2.在下载后生成Response发送给 Spider之前，也就是我们可以在生成 Response 被 Spider 解析之前对其进行修改**\n",
    "\n",
    "修改User-Agent、处理重定向、设置代理，失败重试、设置Cookies等功能都需要借助它来实现。\n",
    "\n",
    "## 1.使用说明\n",
    "\n",
    "Scrapy其实已经提供了许多Downloader Middleware，比如负责失败重试，自动重定向等功能的Middleware，它们被DOWNLOADER_MIDDLEWARES_BASE变量所定义。\n",
    "\n",
    "如果自己定义的下载中间件要添加到项目里，DOWNLOADER_MIDDLEWARE_BASE变量不能直接修改，Scrapy提供了另外一个设置变量DOWNLOADER_MIDDLEWARES，我们直接修改这个变量就可以添加自己定义的下载中间件，以及禁用DOWNLOADER_MIDDLEWARE_BASE里面定义的DOWNLOADER_MIDDLEWARE。\n",
    "\n",
    "## 核心方法\n",
    "\n",
    "核心方法有如下三个：\n",
    "\n",
    "### process_request(request, spider).\n",
    "### process_response(request,response,spider)\n",
    "### process_exception(request,exception,spider)\n",
    "\n",
    "我们只需要实现至少一个方法就可以定义一个下载中间件。\n",
    "\n",
    "### A.process_request(request, spider)\n",
    "\n",
    "在 Request被Scrapy引擎调度给下载器之前，我们可以使用此方法对Request进行处理。方法的返回值必须为 None、Response对象或者Request对象,或者抛出IgnoreRequest异常\n",
    "参数：\n",
    "1.request:Request对象，即被处理的Request。\n",
    "2.spider: Spider对象，即此Request对应的Spider。\n",
    "\n",
    "当返回是None时，正常执行，这个过程就是修改Request的过程。\n",
    "\n",
    "当返回为Response对象时，更低优先级的下载中间件的process_request()方法和process_exception()方法就不会被继续调用，每个下载中间件的process_response()方法被依次调用，调用完毕之后，直接将Response对象发送给Spider处理。\n",
    "\n",
    "当返回为Request对象时，更低优先级的下载中间件的process_request()方法会停止执行，这个Request会被重新方到调度队列里等待调用，如果被Scheduler调度了，那么所有的下载中间件的process_request()方法会被重新按照顺序执行。\n",
    "\n",
    "当抛出异常时，则所有的下载中间件的process_exception()方法会依次执行。如果没有一个方法处理这个异常，那么Request的erroback()方法就会回调，还没有处理该异常的方法，那么它就会被忽略。\n",
    "\n",
    "## B.process_response(request,response,spider)\n",
    "\n",
    "在下载器执行下载之后，会得到Response，在引擎将Response发送给Spider解析之前，我们可以使用此方法对Response进行处理。方法的返回值必须为 Response对象或者Request对象,或者抛出IgnoreRequest异常\n",
    "\n",
    "参数：\n",
    "\n",
    "1.request:Request对象，即此Response对应的Request\n",
    "\n",
    "2.response:Response对象，即此被处理的Response\n",
    "\n",
    "3.spider:Spider对象，即此Response对应的Spider\n",
    "\n",
    "当返回为Request对象时，更低优先级的下载中间件的process_response()方法不会继续调用，该Request对象被重新放入调度队列等待调度。然后该Request会被process_request()方法顺次处理。\n",
    "\n",
    "当返回为Response对象时，更低优先级的下载中间件的process_response()方法会继续调用，继续对该Response进行处理。\n",
    "\n",
    "当返回异常时，同上。\n",
    "\n",
    "## C.process_response(request,response,spider)\n",
    "\n",
    "当返回为Response对象时，更低优先级的下载中间件的process_exception()方法不会继续调用，每个 Downloader Middleware的process_response()方法转而被依次调用。\n",
    "\n",
    "当返回为Request对象时，更低优先级的下载中间件的process_exception()方法不会继续调用，该 Request 对象会重新放到调度队列里面等待被调度，它相当于一个全新的\n",
    "Request 。然后，该 Request 又会被 process_request()方法顺次处理\n",
    "\n",
    "当返回None时，更低优先级的 Downloader Middleware process_exception()会被继续顺次调用，直到所有的方法都被调度完毕。\n",
    "\n",
    "## Spider Middleware\n",
    "\n",
    "![](ScrapyFrame.JPG)\n",
    "当Downloader生成Response之后，Response会被发送给Spider，在发送给Spider之前，Response会首先经过Spider Middleware处理，当Spider处理生成Item和Request之后，Item和Request还会经过Spider Middleware的处理。\n",
    "\n",
    "Spider Middleware 有如下三个作用：\n",
    "\n",
    "我们可以在Downloader生成的Response发送给Spider之前，也就是在Response发送给Spider之前对Response进行处理。\n",
    "\n",
    "我们可以在Spider生成的Request发送给Scheduler之前，也就是在Request发送给Scheduler之前对Request进行处理。\n",
    "\n",
    "我们可以在Spider生成的Item发送给Item Pipeline之前，也就是在Item发送给Item Pipeline之前对Item进行处理。\n",
    "\n",
    "### I.Introduction\n",
    "需要说明的是，Scrapy其实已经提供了许多Spider Middleware，它们被SPIDER_MIDDLEWARES_BASE这个变量所定义。\n",
    "\n",
    "其内容如下："
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "raw",
   "source": [
    "{\n",
    "    'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,\n",
    "    'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,\n",
    "    'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,\n",
    "    'scrapy.spidermiddlewares.urllength.UrllengthMiddleware': 800,\n",
    "    'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,\n",
    "}"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "和Downloader Middleware一样，Spider Middleware首先加入到SPIDER_MIDDLEWARES设置中，该设置会和Scrapy中SPIDER_MIDDLEWARES_BASE定义的Spider Middleware合并。然后根据键值的数字优先级排序，得到一个有序列表。第一个Middleware是最靠近引擎的，最后一个Middleware是最靠近Spider的。\n",
    "\n",
    "## II.核心方法：\n",
    "\n",
    "Scrapy内置的Spider Middleware为Scrapy提供了基础的功能。如果我们想要扩展其功能，只需要实现某几个方法即可。\n",
    "\n",
    "每个Spider Middleware都定义了以下一个或多个方法的类，核心方法有如下4个："
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "raw",
   "source": [
    "process_spider_input(response, spider)\n",
    "process_spider_output(response, result, spider)\n",
    "process_spider_exception(response, exception, spider)\n",
    "process_start_requests(start_requests, spider)"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "只需要实现其中一个方法就可以定义一个Spider Middleware\n",
    "\n",
    "## III.process_spider_input(response, spider)\n",
    "\n",
    "当Response被Spider Middleware处理时，process_spider_input()方法被调用。\n",
    "\n",
    "process_spider_input()方法的参数有如下两个：\n",
    "\n",
    "response，是Response对象，即被处理的Response;spider，是Spider对象，即该Response对应的Spider。\n",
    "\n",
    "process_spider_input()应该返回None或者抛出一个异常。\n",
    "\n",
    "如果它返回None，Scrapy将会继续处理该Response，调用所有其他的Spider Middleware，直到Spider处理该Response。\n",
    "\n",
    "如果它抛出一个异常，Scrapy将不会调用任何其他Spider Middleware的process_spider_input()方法，而调用Request的errback()方法。errback的输出将会被重新输入到中间件中，使用process_spider_output()方法来处理，当其抛出异常时则调用process_spider_exception()来处理。\n",
    "\n",
    "## IV.process_spider_output(response, result, spider)\n",
    "\n",
    "当Spider处理Response返回结果时，process_spider_output()方法被调用。\n",
    "\n",
    "process_spider_output()方法的参数有如下三个：\n",
    "\n",
    "response，是Response对象，即生成该输出的Response。\n",
    "\n",
    "result，包含Request或Item对象的可迭代对象，即Spider返回的结果。\n",
    "\n",
    "spider，是Spider对象，即其结果对应的Spider。\n",
    "\n",
    "process_spider_output()必须返回包含Request或Item对象的可迭代对象。\n",
    "\n",
    "## V.process_spider_exception(response, exception, spider)\n",
    "当Spider或Spider Middleware的process_spider_input()方法抛出异常时，process_spider_exception()方法被调用。\n",
    "\n",
    "process_spider_exception()方法的参数有如下三个：\n",
    "\n",
    "response，是Response对象，即异常被抛出时被处理的Response。\n",
    "\n",
    "exception，是Exception对象，即被抛出的异常。\n",
    "\n",
    "spider，是Spider对象，即抛出该异常的Spider。\n",
    "\n",
    "process_spider_exception()必须要么返回None，要么返回一个包含Response或Item对象的可迭代对象。\n",
    "\n",
    "如果它返问None，Scrapy将继续处理该异常，调用其他Spider Middleware中的process_spider_exception()方法，直到所有Spider Middleware都被调用。\n",
    "\n",
    "如果它返回一个可迭代对象，则其他Spider Middleware的process_spider_output()方法被调用，其他的process_spider_exception()不会被调用。\n",
    "\n",
    "## VI.process_start_requests(start_requests, spider)\n",
    "\n",
    "process_start_requests()方法以Spider启动的Request为参数被调用，执行的过程类似于process_spider_output()，只不过它没有相关联的Response，并且必须返回Request。\n",
    "\n",
    "process_start_requests()方法的参数有如下两个：\n",
    "\n",
    "start_requests，是包含Request的可迭代对象，即Start Requests。\n",
    "\n",
    "spider，是Spider对象，即Start Requests所属的Spider。\n",
    "\n",
    "process_start_requests()必须返回另一个包含Request对象的可迭代对象。\n",
    "\n",
    "## Item Pipeline\n",
    "\n",
    "当Item在Spider中被收集之后，它将会被传递到Item Pipeline，这些Item Pipeline组件按定义的顺序处理Item。\n",
    "\n",
    "item pipeline的主要作用：\n",
    "\n",
    "1.清理html数据\n",
    "\n",
    "2.验证爬取的数据\n",
    "\n",
    "3.去重并丢弃\n",
    "\n",
    "4.将爬取的结果保存到数据库中或文件中\n",
    "\n",
    "每个Item Pipeline组件是一个独立的python类，其中必须实现process_item(self,item,spider)方法。\n",
    "\n",
    "其他方法：\n",
    "\n",
    "1、open_spider(spider)就是打开spider时候调用的\n",
    "\n",
    "2、close_spider(spider)关闭spider时候调用\n",
    "\n",
    "3、from_crawler(cls, crawler)一般用来从settings.py中获取常量的\n",
    "\n",
    "4、process_item(item, spider)是必须实现的,别的都是选用的\n",
    "\n",
    "## 常用方法介绍：\n",
    "\n",
    "1、process_item(item, spider)参数介绍\n",
    "\n",
    "item是要处理的item对象\n",
    "\n",
    "spider当前要处理的spider对象\n",
    "\n",
    "2、process_item(item, spider)返回值\n",
    "\n",
    "返回item就会继续给优先级低的item pipeline二次处理\n",
    "\n",
    "如果直接抛出DropItem的异常就直接丢弃该item\n",
    "\n",
    "3、open_spider(spider)是在开启spider的时候触发的,常用于初始化操作(常见开启数据库连接,打开文件)\n",
    "\n",
    "4、close_spider(spider)是在关闭spider的时候触发的,常用于关闭数据库连接\n",
    "\n",
    "5、from_crawler(cls, crawler)是一个类方法(需要使用@classmethod装饰器标识),常用于从settings.py获取配置信息\n",
    "\n",
    "## 项目(实验)介绍：\n",
    "以爬取360摄影美图，分别实现MongoDB存储/MySQL存储/图片存储\n",
    "\n",
    "结果参考文件Images360\n",
    "\n",
    "## 对接Selenium\n",
    "\n",
    "Scrapy对接Selenium，使用PhantomJS实现，项目"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "raw",
   "source": [
    "scrapy startproject scrapyseleniumtest\n",
    "scrapy genspider taobao taobao.com"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "注意在settings.py文件下选择设置\n",
    "\n",
    "**ROBOTSTXT_OBEY=False**\n",
    "\n",
    "对接的具体项目在downloader middleware里面实现\n",
    "\n",
    "项目文件：scrapyseleniumtest,其中的middlewares配置如下："
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "# -*- coding: utf-8 -*-\n",
    "\n",
    "from selenium import webdriver\n",
    "from selenium.common.exceptions import TimeoutException\n",
    "from selenium.webdriver.common.by import By\n",
    "from selenium.webdriver.support.ui import WebDriverWait\n",
    "from selenium.webdriver.support import expected_conditions as EC\n",
    "from scrapy.http import HtmlResponse\n",
    "from logging import getLogger\n",
    "\n",
    "\n",
    "class SeleniumMiddleware():\n",
    "    def __init__(self, timeout=None, service_args=[]):\n",
    "        self.logger = getLogger(__name__)\n",
    "        self.timeout = timeout\n",
    "        self.browser = webdriver.PhantomJS(service_args=service_args)\n",
    "        self.browser.set_window_size(1400, 700)\n",
    "        self.browser.set_page_load_timeout(self.timeout)\n",
    "        self.wait = WebDriverWait(self.browser, self.timeout)\n",
    "\n",
    "    def __del__(self):\n",
    "        self.browser.close()\n",
    "\n",
    "    def process_request(self, request, spider):\n",
    "        \"\"\"\n",
    "        用PhantomJS抓取页面\n",
    "        :param request: Request对象\n",
    "        :param spider: Spider对象\n",
    "        :return: HtmlResponse\n",
    "        \"\"\"\n",
    "        self.logger.debug('PhantomJS is Starting')\n",
    "        page = request.meta.get('page', 1)\n",
    "        try:\n",
    "            self.browser.get(request.url)\n",
    "            if page > 1:\n",
    "                input = self.wait.until(\n",
    "                    EC.presence_of_element_located((By.CSS_SELECTOR, '#mainsrp-pager div.form > input')))\n",
    "                submit = self.wait.until(\n",
    "                    EC.element_to_be_clickable((By.CSS_SELECTOR, '#mainsrp-pager div.form > span.btn.J_Submit')))\n",
    "                input.clear()\n",
    "                input.send_keys(page)\n",
    "                submit.click()\n",
    "            self.wait.until(\n",
    "                EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#mainsrp-pager li.item.active > span'), str(page)))\n",
    "            self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.m-itemlist .items .item')))\n",
    "            return HtmlResponse(url=request.url, body=self.browser.page_source, request=request, encoding='utf-8',\n",
    "                                status=200)\n",
    "        except TimeoutException:\n",
    "            return HtmlResponse(url=request.url, status=500, request=request)\n",
    "\n",
    "    @classmethod\n",
    "    def from_crawler(cls, crawler):\n",
    "        return cls(timeout=crawler.settings.get('SELENIUM_TIMEOUT'),\n",
    "                   service_args=crawler.settings.get('PHANTOMJS_SERVICE_ARGS'))"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Scrapy 通用爬虫\n",
    "\n",
    "背景：通过 Scrapy，我们可以轻松地完成一个站点爬虫的编写。但如果抓取的站点量非常大，比如爬各大媒体的新闻信息，多个Spider则可能包含很多重复代码\n",
    "\n",
    "如果我们将各个站点的 Spider 的公共部分保留下来，不同的部分提取出来作为单独的配置，如爬取规则、页面解析方式等抽离出来做成一个配置文件，那么我们在新增一个爬虫的时候，只需要实现这些网站的爬取规则和提取规则即可\n",
    "\n",
    "这里我们使用到CrawlSpider通用爬虫，继承父类的Spider类。参考官方文档：\n",
    "\n",
    "[http://scrapy.readthedocs.io/en/latest/topics/spiders.html#crawlspider](http://scrapy.readthedocs.io/en/latest/topics/spiders.html#crawlspider)\n",
    "\n",
    "在这里我们可以定义爬取的规则，用专门的数据结构Rule表示\n",
    "\n",
    "CrawlSpider提供了rules属性和parse_start_url()方法\n",
    "\n",
    "parse_start_url()方法，当start_urls里对应的Request得到 Response\n",
    "时，该方法被调用，它会分析 Response 并必须返回 Item 对象或者 Request 对象.\n",
    "\n",
    "rules的定义比较重要"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "raw",
   "source": [
    "class scrapy.contrib.spiders.Rule(link_extractor,callback=None,cb_kwargs=None,follow=None,process_links=None, process_request=None)"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "下面说明上述参数：\n",
    "\n",
    "link_extractor 为LinkExtractor，用于定义需要提取的链接\n",
    "\n",
    "callback参数：当link_extractor获取到链接时参数所指定的值作为回调函数\n",
    "\n",
    "callback参数使用注意：\n",
    "当编写爬虫规则时，请避免使用parse作为回调函数。于CrawlSpider使用parse方法来实现其逻辑，如果您覆盖了parse方法，crawlspider将会运行失败\n",
    "\n",
    "follow：指定了根据该规则从response提取的链接是否需要跟进。当callback为None,默认值为True，否则默认False\n",
    "\n",
    "process_links：主要用来过滤由link_extractor获取到的链接\n",
    "\n",
    "process_request：主要用来过滤在rule中提取到的request\n",
    "\n",
    "cb_kwargs：字典，包含传递给回调函数的参数"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "LinkExtractors简介："
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "raw",
   "source": [
    "class scrapy.linkextractors.LinkExtractor(\n",
    "    allow = (),\n",
    "    deny = (),\n",
    "    allow_domains = (),\n",
    "    deny_domains = (),\n",
    "    deny_extensions = None,\n",
    "    restrict_xpaths = (),\n",
    "    tags = ('a','area'),\n",
    "    attrs = ('href'),\n",
    "    canonicalize = True,\n",
    "    unique = True,\n",
    "    process_value = None\n",
    ")"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "allow：满足括号中“正则表达式”的值会被提取，如果为空，则全部匹配。\n",
    "\n",
    "deny：与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。\n",
    "\n",
    "allow_domains：会被提取的链接的domains。白名单\n",
    "\n",
    "deny_domains：一定不会被提取链接的domains。黑名单\n",
    "\n",
    "restrict_xpaths：使用xpath表达式，和allow共同作用过滤链接(只选到节点，不选到属性)\n",
    "\n",
    "## Item Loader模块"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "from scrapy.loader import ItemLoader\n",
    "#通过item loader加载item\n",
    "item_loader = ItemLoader(item=JoBoleArticleItem(),response=response)\n",
    "#针对直接取值的情况\n",
    "item_loader.add_value('front_image_url','front_image_url')\n",
    "#针对css选择器\n",
    "item_loader.add_css('title','div.entry-header h1::text')\n",
    "item_loader.add_css('create_date','p.entry-meta-hide-on-mobile::text')\n",
    "item_loader.add_css('praise_num','#112547votetotal::text')\n",
    "#针对xpath的情况\n",
    "item_loader.add_xpath('content','//*[@id=\"post-112239\"]/div[3]/div[3]/p[1]')\n",
    "#把结果返回给item对象\n",
    "article_item = item_loader.load_item()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [],
   "metadata": {
    "collapsed": false
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}